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Abstract 


Numerous problems in machine learning are formulated as optimization with 
manifold constraints. In this paper, we propose the Manifold alternating directions 
method of multipliers (MADMM), an extension of the classical ADMM scheme 
for manifold-constrained non-smooth optimization problems and show its appli¬ 
cation to several challenging problems in dimensionality reduction, data analysis, 
and manifold learning. 


1 Introduction 

A wide range of problems in machine learning, pattern recognition, computer vision, and signal 
processing is formulated as optimization problems where the variables are constrained to lie on 
some Riemannian manifold. For example, optimization on the Grassman manifold comes up in 
multi-view clustering Ga and matrix completion 1^ . Optimization on the Stiefel manifold arises 
in a plethora of applications ranging from classical ones such as eigenvalue problems, assignment 
problems, and Procrustes problems |[7|, to more recent ones such as 1-bit compressed sensing ll^ . 
Problems involving products of Stiefel manifolds include shape correspondence (3^ . manifold learn¬ 
ing (211, sensor localization uni, structural biology ca, and structure from motion recovery o. 
Optimization on the sphere is used in principle geodesic analysis (481, a generalization of the clas¬ 
sical PC A to non-Euclidean domains. Optimization over the manifold of fixed-rank matrices arises 
in maxcut problems (28]| , sparse principal component analysis (28], regression fSEl , matrix comple¬ 
tion |[T0l|45l, and image classification (4^ . Oblique manifolds are encountered in problems such 
as independent component analysis and joint diagonalization m, blind source separation ED, and 
prediction of stock returns (26l . 


Though some instances of manifold optimization such as eigenvalues problems have been treated ex¬ 
tensively in the distant past, the first general purpose algorithms appeared only in the 1990s El HD. 
With the emergence of numerous applications during the last decade, especially in the machine 
learning community, there has been an increased interest in general-purpose optimization on differ¬ 
ent manifolds 0, leading to several manifold optimization algorithms such as conjugate gradients 
(20l . trust regions (2, and Newton (44l[4l. Boumal et al. (TTIl released the MATLAB package 
Manopt, as of today the most complete generic toolbox for smooth optimization on manifolds, in¬ 
cluding a variety of manifolds and solvers. 

In this paper, we are interested in manifold-constrained minimization of non-smooth functions. Re¬ 
cent applications of such problems include several formulations of robust PC A (Hi, the computation 
of compressed modes in Euclidean (40l and non-Euclidean E3 spaces, robust Euclidean embedding 
ca, synchronization of rotation matrices (46l , and functional correspondence (T7ll32 . 

Prior work Broadly speaking, optimization methods for non-smooth functions break into three 
classes of approaches. First, smoothing methods replace the non-differentiable objective function 
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with its smooth approximation ca. Such methods typically suffer from a tradeoff between accu¬ 
racy (how far is the smooth approximation from the original objective) and convergence speed (less 
smooth functions are usually harder to optimize). A second class of methods use sub gradients as 
a generalization of derivatives of non-differentiable functions. In the context of manifold optimiza¬ 
tion, several subgradient approaches have been proposed GllllslEIlIil. The third class of methods 
are splitting approaches. Such methods have been studied mostly for problems involving the mini¬ 
mization of matrix functions with orthogonality constraints. Lai and Osher proposed the method of 
splitting orthogonal constraints (SOC) based on the Bregman iteration El. Neumann et al. m 
used a different splitting scheme for the same class of problems. 


Contributions In this paper, we propose Manifold alternating direction method of multipliers 
(MADMM), an extension of the classical ADMM scheme 1^ for manifold-constrained non-smooth 
optimization problems. The core idea is a splitting into a smooth problem with manifold constraints 
and a non-smooth unconstrained optimization problem. Our method has a number of advantages 
common to ADMM approaches. First, it is very simple to grasp and implement. Second, it is 
generic and not limited to a specific manifold, as opposed e.g. to O [37l developed for the Stiefel 
manifold, or ED developed for the oblique manifold. Third, it makes very few assumptions about 
the properties of the objective function. Fourth, in some settings, our method lends itself to paral¬ 
lelization on distributed computational architectures (121 . Finally, our method demonstrates faster 
convergence than previous methods in a broad range of applications. 


2 Manifold optimization 

The term manifold- or manifold-constrained optimization refers to a class of problems of the form 


min /(X), 


( 1 ) 


where / is a smooth real-valued function, X is an m x n real matrix, and M is some Riemannian 
submanifold of The manifold is not a vector space and has no global system of coordinates, 

however, locally at point X, the manifold is homeomorphic to a Euclidean space referred to as the 
tangent space TxM.- 


The main idea of manifold optimization is to treat the objective as a function f : M ^ R defined 
on the manifold, and perform descent on the manifold itself rather than in the ambient Euclidean 
space. On a manifold, the intrinsic (Riemannian) gradient of / ^1 point X is a vector 

in the tangent space TxM that can be obtained by projecting the standard (Euclidean) gradient 
V/(X) onto TxM by means of a projection operator Px (see an illustration below). A step along 
the intrinsic gradient direction is performed in the tangent plane. In order to obtain the next iterate, 
the point in the tangent plane is mapped back to the manifold by means of a retraction operator Rx , 
which is typically an approximation of the exponential map. Eor many manifolds, the projection P 
and retraction R operators have a closed form expression. 


A conceptual gradient descent-like manifold optimization is presented in Algorithmic Eor a com¬ 
prehensive introduction to manifold optimization, the reader is referred to El 


repeat 

Compute the extrinsic gradient V/(X*^^^) 
Projection: Vai/(X('=)) = 

Compute the step size along the descent 
direction 

Retraction: (—V>t/(X^^^)) 

until convergence'. 

Algorithm 1: Conceptual algorithm for smooth opti¬ 
mization on manifold M . 
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3 Manifold ADMM 


Let us now consider general problems of the form 

min f{X)+g{AX), (2) 

xeM 

where / and g are smooth and non-smooth real-valued functions, respectively, Ais 3.k x m matrix, 
and the rest of the notation is as in problem Q. Examples of g often used in machine learning 
applications are nuclear-, Li-, or I/ 2 ,i-norms. Because of non-smoothness of the objective function. 
Algorithm [2 cannot be used directly to minimize (0. 

In this paper, we propose treating this class of problems using the alternating directions method of 
multipliers (ADMM). The key idea is that problem ^ can be equivalently formulated as 

min f(X)^g(Z) s.t. Z = AA (3) 

xeM,zeR^^'^ 

by introducing an artificial variable Z and a linear constraint. The method of multipliers 12511421 . 
applied to only the linear constraints in ([^, leads to the minimization problem 

min f{X)+g{Z) + l\\AX-Z + U\\l (4) 

xeM,zeR^^^ 

where p > 0 and U G have to be chosen and updated appropriately (see below). This formu¬ 
lation now allows splitting the problem into two optimization sub-problems w.r.t. to X and Z, which 
are solved in an alternating manner, followed by an updating of U and, if necessary, of p. Observe 
that in the first sub-problem w.r.t. X we minimize a smooth function with manifold constraints, 
and in the second sub-problem w.r.t. Z we minimize a non-smooth function without manifold con¬ 
straints. Thus, the problem breaks down into two well-known sub-problems. This method, which 
we call Manifold alternating direction method of multipliers (MADMM), is summarized in Algo- 
rithm|2l 


Initialize k ^ = X^^\ = 0. 

repeat 

A-step: A(^+i) = argmin^^^ /(A) + f ||AA - -h 
Z-step: =argmin^^^(Z) + f ||AA^^+i) - Z + ||2 

Update + AA^^+i) - Z^^+i) and A: ^ A: + 1 

until convergence'. 

Algorithm 2: Generic MADMM method for non-smooth optimization on manifold M. 

Note that MADMM is extremely simple and easy to implement. The A-step is the setting of Algo¬ 
rithmic and can be carried out using any standard smooth manifold optimization method. Similarly 
to common implementation of ADMM algorithms, there is no need to solve the A-step problem 
exactly', instead, only a few iterations of manifold optimization are done. Furthermore, for some 
manifolds and some functions /, the A-step has a closed-form solution. The implementation of the 
Z-step depends on the non-smooth function g, and in many cases has a closed-form expression: for 
example, when g is the Li-norm, the Z-step boils down to simple shrinkage, and when g is nuclear 
norm, the Z-step is performed by singular value shrinkage. pO is the only parameter of the algorithm 
and its choice is not critical for convergence. In our experiments, we used a rather arbitrary fixed 
value of p, though in the ADMM literature it is common to adapt p at each iteration, e.g. using the 
strategy described in m. 

4 Results and Applications 

In this section, we show experimental results providing a numerical evaluation of our approach on 
several challenging applications from the domains of dimensionality reduction, data analysis, pattern 
recognition, and manifold learning. All our experiments were implemented in MATLAB; we used 
the conjugate gradients and trust regions solvers from the Manopt toolbox ifTTl for the A-step. Time 
measurements were carried out on a PC with Intel Xeon 2.4 GHz CPU. 
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4.1 Compressed modes 


Problem setting Our first application is the computation of compressed modes, an approach for 
constructing localized Fourier-like orthonormal bases recently introduced in 1401 . Let us be given a 
manifold S with a Laplacian A, where in this context, ‘manifold’ can refer to both continuous or dis¬ 
cretized manifolds of any dimension, represented as graphs, triangular meshes, etc., and should not 
be confused with the matrix manifolds we have discussed so far referring to manifold-constrained 
optimization problems. Here, we assume that the manifold is sampled at n points and the Laplacian 
is represented as an n x n sparse symmetric matrix. In many machine learning applications such as 
spectral clustering lf38]| , non-linear dimensionality reduction, and manifold learning O, one is inter¬ 
ested in finding the first k eigenvectors of the Laplacian = ^A, where ^ is the n x k matrix of 
the first eigenvectors arranged as columns, and A is the diagonal k x k matrix of the corresponding 
eigenvalues. 

The first k eigenvectors of the Laplacian can be computed by minimizing the Dirichlet energy 

min trace(^^A^) s.t. = / (5) 

with orthonormality constraints. Laplacian eigenfunctions form an orthonormal basis on the Hilbert 
space L‘^{S) with the standard inner product, and are a generalization of the Fourier basis to non- 
Euclidean domains. The main disadvantage of such bases is that its elements are globally supported. 
Ozolips et al. 1401 proposed a construction of localized quasi-eigenbases by solving 

min trace(^^ A^) +/ill^lli s.t. (6) 

where /i > 0 is a parameter. The Li-norm (inducing sparsity of the resulting basis) together with 
the Dirichlet energy (imposing smoothness of the basis functions) lead to orthogonal basis functions, 
referred to as compressed modes that are localized and approximately diagonalize A. Lai and Osher 
El and Neumann et al. ITTl proposed two different splitting methods for solving problem ([^. Due 
to lack of space, the reader is referred to EH 113 for ^ detailed description of both methods. 

Solution Here, we realize that problem I© is an instance of manifold optimization on the Stiefel 
manifold §(n, k) = {X G X = 1} and solve it using MADMM, which assumes in 

this setting the form of Algorithmic The X-step involves optimization of a smooth function on the 
Stiefel manifold and can be carried out using standard manifold optimization algorithms; we use 
conjugate gradients and trust regions solvers. The Z-step requires the minimization of the sum of 
Li- and L 2 -norms, a standard problem in signal processing that has an explicit solution by means of 
thresholding (using the shrinking operator). In all our experiments, we used the parameter p = 2 for 
MADMM. For comparison with the method of EH. we used the code provided by the authors. The 
method of iJTl we implemented ourselves. All the methods were initialized by the same random 
orthonormal n x k matrix. 


Input n X n Laplacian matrix A, parameter p > 0 
Output k first compressed modes of A 

Initialize k ^ 1, ^random orthonormal matrix, = 0. 

repeat 

^(/c+i) _ argmin^^g('^^;.) trace(^^A^) + ^||^ — -h 
Z{k+i) =sign($(^+i) +f/(^))max{0, 

jjik+i) ^ jj(k) ^ ^(/c+i) _ ^(/c+i) and A: ^ A: + 1 
until convergence’. 

Algorithm 3: MADMM method for computing compressed modes. 

Results To study the behavior of ADMM, we used a simple ID problem with a Euclidean Lapla¬ 
cian constructed on a line graph with n vertices. Eigure (top left) shows the convergence of 
MADMM with different random initializations. We observe that the method converges globally 
independently of the initialization. Eigure (top right) shows the convergence of MADMM using 
different solvers and number of iterations in the X-step. We did not observe any significant change 
in the behavior. Eigure(bottom left) studies the scalability of different algorithms, speaking clearly 
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Figure 1: The first six compressed modes of a human mesh containing n = 8K points computed 
using MADMM. Parameter fi = 10“^ and three manifold optimization iterations in X-step were 
used in this experiment. 






Figure 2: Compressed modes problem. Top left: convergence of MADMM on a problem of size 
n = 500, /c = 10 with different random initialization. Top right: convergence of MADMM using 
different solvers and number of iterations at X-step on the same problem. Bottom left: scalability of 
different methods; shown is time/iteration on a problem of different size (fixed k = 10 and varying 
n). Bottom right: comparison of convergence of different splitting methods and MADMM on a 
problem of size n = 8K. 


in favor of MADMM compared to the methods of 1341 [JTl . Figurel^shows compressed modes com¬ 
puted on a triangular mesh of a human sampled at 8K vertices, using the cotangent weights formula 
ED to discretize the Laplacian. Figure [^(bottom right) shows the convergence of different methods 
on this dataset. MADMM shows the best performance among the compared methods. 

4.2 Functional correspondence 

Problem setting Our second problem is coupled diagonalization, which is used for finding func¬ 
tional correspondence between manifolds l3^ and multi-view clustering 1^ . Let us consider a 
collection of L manifolds each discretized at rii points and equipped with a Laplacian 

represented as an rii x rii matrix. Tht functional correspondence between manifolds Si and Sj is an 
rij X rii matrix Tij mapping functions from Lf{Si) to Lf{Sj). It can be efficiently approximated us¬ 
ing the first k Laplacian eigenvectors as Tij ^ ^jXij^J, where Xij is the k x k matrix translating 
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Figure 3: Functional correspondence problem. Left: evaluation of the functional correspondence 
obtained using MADMM and least squares. Shown in the percentage of correspondences falling 
within a geodesic ball of increasing radius w.r.t. the groundtruth correspondence. Right: conver¬ 
gence of MADMM and smoothing method for various values of the smoothing parameter. 


Fourier coefficients from basis to basis represented sls rii x k and rij x k matrices, respec¬ 
tively. Imposing a further assumption that Tij is volume-preserving, Xij must be an orthonormal 
matrix 1^ , and thus can be represented as a product of two orthonormal matrices Xij = XiXj. 
For each pair of manifolds Si^Sj, we assume to be given a set of Qij functions in L‘^{Si) arranged 
as columns of an rii x qij matrix Fij and the corresponding functions in L‘^{Sj) represented by the 
rij X Qij matrix Gij . The correspondence between all the manifolds can be established by solving 
the problem 


min 




L 

i=l 


S.t. 


X^^Xi = I Vi = l,...,L, (7) 


T /O 

where off(A) = Y^i^j afj, and the use of the I/ 2 ,i-norm || A|| 2 ,i = ^fj) allows to cope 

with outliers in the correspondence data 11611221. The problem can be interpreted as simultaneous 
diagonalization of the Laplacians Ai,..., Aj^ 1211 . As correspondence data F, G, one can use 
point-wise correspondence between some known ‘seeds’, or, in computer graphics applications, 
some shape descriptors 1^ . Geometrically, the matrices Xi can be interpreted as rotations of the 
respective bases, and the problem tries to achieve a coupling between the bases = ^iXi while 
making sure that they approximately diagonalize the respective Laplacians. 


Solution Here, we consider problem 0 as optimization on a product of L Stiefel manifolds, 
(Xi,..., X]^) G S^(k, k) and solve it using the MADMM method. The X-step of MADMM in our 
experiments was performed using four iterations of the manifold conjugate gradients solver. As in 
the previous problem, the Z-step boils down to simple shrinkage. We used p = 1 and initialized all 
Xi = /. 


Results We computed functional correspondences between L = 6 human 3D shapes from the 
TOSCA dataset O using k = 2b basis functions and g = 25 seeds as correspondence data, 
contaminated by 16% outliers. Figure (left) analyzes the resulting correspondence quality using 
the Princeton protocol ||30l, plotting the percentage of correspondences falling within a geodesic 
ball of increasing radius w.r.t. the groundtruth correspondence. For comparison, we show the results 
of a least-squares solution used in (see Figure |^. Figure (right) shows the convergence 
of MADMM in a correspondence problem with L = 2 shapes. For comparison, we show the 

convergence of a smoothed version of the I/ 2 ,i-norm ||A|| 2 ,i ~ ^j in (|^ for 

various values of the smoothing parameter e. 
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Figure 4: Examples of correspondences obtained with MADMM (top) and least-squares solution 
(bottom). Similar colors encode corresponding points. Bottom left: examples of correspondence 
between a pair of shapes (outliers are shown in red). 


4.3 Robust Euclidean embedding 

Problem setting Our third problem is an Li formulation of the multidimensional scaling (MDS) 
problem treated in ca under the name robust Euclidean embedding (REE). Let us be given an 
n X n matrix D of squared distances. The goal is to find a /c-dimensional configuration of points 
X G such that the Euclidean distances between them are as close as possible to the given 

ones. The classical MDS approach employs the duality between Euclidean distance matrices and 
Gram matrices: a squared Euclidean distance matrix D can be converted into a similarity matrix by 
means of double-centering B = where H = /—^11^. Conversely, the squared distance 

matrix is obtained from B by {dist{B))ij = bu + bjj — 2bij. The similarity matrix corresponding 
to a Euclidean distance matrix is positive semi-definite and can be represented as a Gram matrix 
B = XX~^, where X is the desired embedding. In the case when D is not Euclidean, B acts 
as a low-rank approximation of the similarity matrix (now not necessarily positive semi-definite) 
associated with D, leading to the problem 

min \\HDH - XX'^Wl (8) 

XeRrnXk 

known as classical MDS or classical scaling, which has a closed form solution by means of eigen- 
decomposition of HDH. 

The main disadvantage of classical MDS is the fact that noise in a single entry of the distance 
matrix D is spread over entire column/row by the double centering transformation. To cope with 
this problem, Cayton and Dasgupta lua proposed an Li version of the problem, 

min \\D — dist(5)||i s.t. B y 0, rank(5) < k, (9) 

where the use of the Li-norm efficiently rejects outliers. The authors proposed two solutions for 
problem a semi-definite programming (SDP) formulation and a subgradient descent algorithm 
(the reader is referred to ina for a detailed description of both methods). 

Solution Here, we consider © as a non-smooth optimization of the form © on the manifold of 
fixed-rank positive semi-definite matrices and solve it using MADMM. Note that in this case, we 
have only the non-smooth function g and / = 0. The X-step of the MADMM algorithm is manifold 
optimization of a quadratic function, carried out using two iterations of manifold conjugate gradients 
solver. The Z-step is performed by shrinkage. In our experiments, all the compared methods were 
initialized with the classical MDS solution and the value p = 10 was used for MADMM. SDP 
approach was implemented using MATLAB CVX toolbox ||24l. 
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Figure 5: Embedding of the noisy distances between 500 US cities in the plane using classical MDS 
(blue) and REE solved using MADMM (red). The distance matrix was contaminated by sparse noise 
by doubling the distance between some cities. Note how classical MDS is sensitive to outliers. 




Figure 6: REE problem. Left: scalability of different algorithms; shown is single iteration com¬ 
plexity as functions of the problem size n using random distance data. SDP did not scale beyond 
n = 100. Right: example of convergence of MADMM and subgradient algorithm of HU on the 
US cities problem of size n = 500. The subgradient algorithm is very sensitive to the choice of the 
initial step size c; choosing too large c breaks the convergence. 


Results Figurej^ shows an example of 2D Euclidean embedding of the distances between 500 US 
cities, contaminated by sparse noise (the distance between two cities was doubled, as in a similar 
experiment in flSl ). The robust embedding is insensitive to such outliers, while the classical MDS 
result is completely ruined. Figure (right) shows an example of convergence of the proposed 
MADMM method and the subgradient descent of Ha on the same dataset. We observed that our 
algorithm outperforms the subgradient method in terms of convergence speed. Furthermore, the 
subgradient method appears to be very sensitive to the initial step size c; choosing too small a step 
leads to slower convergence, and if the step is too large the algorithm may fail to converge. Figure 
(left) studies the scalability of the subgradient-, SDP-, and MADMM-based solutions for the REE 
problem, plotting the complexity of a single iteration as function of the problem size on random 
data. The typical number of iterations was of the order of 20 for SDP, 50 for MADMM, and 500 for 
the subgradient method. We see that MADMM scaled better than other approaches, and that SDP is 
not applicable to large problems. 
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5 Discussion and Conclusions 


We presented MADMM, a generic algorithm for optimization of non-smooth functions with man¬ 
ifold constraints, and showed that it can be efficiently used in many important problems from the 
domains of machine learning, computer vision and pattern recognition, and data analysis. Among 
the key advantages of our method is its remarkable simplicity and lack of parameters to tune - in 
all our experiments, it worked entirely out-of-the-box. We believe that MADMM will be very use¬ 
ful in many other applications in this community. In our experiments, we observed that MADMM 
converged independently of the initialization; a theoretical study of convergence properties is an 
important future direction. 
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